## [1] "/home/banito/Downloads/project4"
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 45 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing missing values (geom_bar).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 22 rows containing non-finite values (stat_bin).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 240 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 237 rows containing non-finite values (stat_bin).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 237 rows containing non-finite values (stat_bin).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite values (stat_bin).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing missing values (geom_bar).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 229 rows containing non-finite values (stat_bin).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing missing values (geom_bar).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing missing values (geom_bar).
The dataset contains 4898 observations with 12 variables generally, 11 input variables based on physicochemical tests which affect wine quality and 1 output variable which is the quality of the wine. The input varaibles are:- fixed acidity (tartaric acid - g / dm^3), volatile acidity (acetic acid - g / dm^3), citric acid (g / dm^3), residual sugar (g / dm^3), chlorides (sodium chloride - g / dm^3, free sulfur dioxide (mg / dm^3), total sulfur dioxide (mg / dm^3), density (g / cm^3), pH, sulphates (potassium sulphate - g / dm3), alcohol (% by volume) and the output varaible based on sensor data is thw qaulity score between 0 and 10.At least 3 wine experts rated the quality of each wine, with 0 as the lowest rating and 10 as the highest rating.
The quality of the white wine is what ulimately matters the most, all the other inputs combination is to prouce a certain kind of test which is attributed to quality of the wine. Most wine quality is concentrated in the cagegories 5, 6, 7, small amount of whitle wine falls to categores 3, 4, 8 and 9 and not in the categories 1,2, and 10.
In the univariate plot section, I have done histogram for all varaibles in the dataset and studied the distibution, most of the varariables are close to normal distributions. Alcohol and residual sugar exibits distribution skwed to the right. Volitile.acidity and citric acid show some irregularies.
I didn’t create any new varaible here ### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
When plotting histograms of the attribute variables, I applied the xlim and the 99 or 95 percentile (quantile function) to limit the upper x-axis value to remove outlier and for better visualization and to see clearer distibutio of that varaible.
##
## Pearson's product-moment correlation
##
## data: wht_wine_quality$quality and wht_wine_quality$alcohol
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
##
## Pearson's product-moment correlation
##
## data: wht_wine_quality$quality and wht_wine_quality$fixed.acidity
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.14121974 -0.08592991
## sample estimates:
## cor
## -0.1136628
##
## Pearson's product-moment correlation
##
## data: wht_wine_quality$quality and wht_wine_quality$citric.acid
## t = -0.6444, df = 4896, p-value = 0.5193
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03720595 0.01880221
## sample estimates:
## cor
## -0.009209091
### negative correlation exists between quality and residual.sugar
##
## Pearson's product-moment correlation
##
## data: wht_wine_quality$quality and wht_wine_quality$residual.sugar
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12524103 -0.06976101
## sample estimates:
## cor
## -0.09757683
##
## Pearson's product-moment correlation
##
## data: wht_wine_quality$quality and wht_wine_quality$chlorides
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2365501 -0.1830039
## sample estimates:
## cor
## -0.2099344
### A weak positive correlation between free.surfur.dioxide and quality
##
## Pearson's product-moment correlation
##
## data: wht_wine_quality$quality and wht_wine_quality$free.sulfur.dioxide
## t = 0.57085, df = 4896, p-value = 0.5681
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01985292 0.03615626
## sample estimates:
## cor
## 0.008158067
##
## Pearson's product-moment correlation
##
## data: wht_wine_quality$quality and wht_wine_quality$total.sulfur.dioxide
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2017563 -0.1474524
## sample estimates:
## cor
## -0.1747372
##
## Pearson's product-moment correlation
##
## data: wht_wine_quality$quality and wht_wine_quality$density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3322718 -0.2815385
## sample estimates:
## cor
## -0.3071233
### Positive correaltion between quality and pH of white wine with correlation cofficient of 0.09942725
##
## Pearson's product-moment correlation
##
## data: wht_wine_quality$quality and wht_wine_quality$pH
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07162022 0.12707983
## sample estimates:
## cor
## 0.09942725
##
## Pearson's product-moment correlation
##
## data: wht_wine_quality$quality and wht_wine_quality$sulphates
## t = 3.7613, df = 4896, p-value = 0.000171
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02571007 0.08156172
## sample estimates:
## cor
## 0.05367788
##
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density +
## fixed.acidity + free.sulfur.dioxide + pH + residual.sugar +
## sulphates + total.sulfur.dioxide + volatile.acidity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8348 -0.4934 -0.0379 0.4637 3.1143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.502e+02 1.880e+01 7.987 1.71e-15 ***
## alcohol 1.935e-01 2.422e-02 7.988 1.70e-15 ***
## chlorides -2.473e-01 5.465e-01 -0.452 0.65097
## citric.acid 2.209e-02 9.577e-02 0.231 0.81759
## density -1.503e+02 1.907e+01 -7.879 4.04e-15 ***
## fixed.acidity 6.552e-02 2.087e-02 3.139 0.00171 **
## free.sulfur.dioxide 3.733e-03 8.441e-04 4.422 9.99e-06 ***
## pH 6.863e-01 1.054e-01 6.513 8.10e-11 ***
## residual.sugar 8.148e-02 7.527e-03 10.825 < 2e-16 ***
## sulphates 6.315e-01 1.004e-01 6.291 3.44e-10 ***
## total.sulfur.dioxide -2.857e-04 3.781e-04 -0.756 0.44979
## volatile.acidity -1.863e+00 1.138e-01 -16.373 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2803
## F-statistic: 174.3 on 11 and 4886 DF, p-value: < 2.2e-16
## Warning: Removed 5 rows containing non-finite values (stat_bin).